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Abstract 

Genome-wide association studies (GWASs) have recently revealed many genetic associations that are shared between 
different diseases. We propose a method, disPCA, for genome-wide characterization of shared and distinct risk factors 
between and within disease classes. It flips the conventional GWAS paradigm by analyzing the diseases themselves, across 
GWAS datasets, to explore their "shared pathogenetics". The method applies principal component analysis (PCA) to gene- 
level significance scores across all genes and across GWASs, thereby revealing shared pathogenetics between diseases in an 
unsupervised fashion. Importantly, it adjusts for potential sources of heterogeneity present between GWAS which can 
confound investigation of shared disease etiology. We applied disPCA to 31 GWASs, including autoimmune diseases, 
cancers, psychiatric disorders, and neurological disorders. The leading principal components separate these disease classes, 
as well as inflammatory bowel diseases from other autoimmune diseases. Generally, distinct diseases from the same class 
tend to be less separated, which is in line with their increased shared etiology. Enrichment analysis of genes contributing to 
leading principal components revealed pathways that are implicated in the immune system, while also pointing to 
pathways that have yet to be explored before in this context. Our results point to the potential of disPCA in going beyond 
epidemiological findings of the co-occurrence of distinct diseases, to highlighting novel genes and pathways that 
unsupervised learning suggest to be key players in the variability across diseases. 
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This is a PLOS Computational Biology Methods article. 
Introduction 

Comorbidity studies show that some distinct diseases tend to co- 
occur [1-6], pointing to a shared genetic and/or environmental 
component. In the era of genome-wide association studies 
(GWASs), direct evidence of shared genetic risk factors of diseases 
comes to light [7]. For example, while it has been previously shown 
that rheumatoid arthritis and type-1 diabetes co-occur [1], GWASs 
have identified 12 genes associated with both diseases [8-16]. More 
broadly, disease genes obtained from the Online Mendelian 
Inheritance in Man [1 7] were used to assemble the Human Disease 
Network (HDN) [1 8, 1 9] , a visual representation of genetic similarity 
between diseases. Pleiotropy of complex diseases and traits has also 
been explored by searching genome-wide for variants implicated in 
more than one disease [16,20,21]. Such studies promise to reveal 
shared genes and offer an expanded understanding from a genetic 
standpoint of why some diseases tend to co-occur. 

Methods for exploring shared genetic risk factors between 
diseases belong to two main categories (see also recent review [7]). 
The first category of methods focuses on finding individual 
variants that are associated with a pair or more of diseases being 
investigated. In one set of such methods, a GWAS is carried out on 



a pooled set of individuals with different diseases [10,16,20,21], or 
by analyzing information for multiple diseases available for the 
same individuals [22,23]. Alternatively, and based only on 
summary statistics of the association test for each single nucleotide 
polymorphism (SNP), one can simply combine p-values from 
several GWASs using Fisher's method [24]. The CPMA (cross- 
phenotype meta-analysis) statistic [25] is another statistic that tests 
whether a SNP is associated to more than one phenotype. In 
addition, methods such as the conditional false discovery rate or 
mixed-models for multiple traits have used known pleiotropy 
between diseases or traits to increase power [26,27]. Studies 
employing these methods have found shared associations between 
pairs of diseases such as Crohn's disease and celiac disease [16], 
other autoimmune disease pairs [20,21], bipolar disorder and 
schizophrenia [26] and multiple sclerosis and schizophrenia [28]. 
They have additionally shown that SNPs associated with one 
autoimmune disease are likely to be associated to other (though 
not all) autoimmune phenotypes [25]. 

The second category of methods focuses on using shared 
variants to learn about the genetic similarity between diseases. 
One method employed by Sirota et al. utilizes the correlation 
between association signals across many SNPs to assess the 
similarity between pairs of diseases and showed that there are 
likely two distinct autoimmune classes where a risk allele for one 
class may be protective in another [29] . Similar methods based on 
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Author Summary 

Epidemiological studies have revealed distinct diseases 
that tend to co-occur in individuals. As genome-wide 
association studies (GWASs) have increased in numbers, 
more evidence regarding the genetic nature of this shared 
disease etiology is revealed. Here, we present a novel 
method that utilizes principal component analysis (PCA) to 
explore the relationships and shared pathogenesis be- 
tween distinct diseases and disease classes. PCA groups 
and distinguishes between data points by uncovering 
hidden axes of variation. Applying PCA to 31 GWASs of 
autoimmune diseases, cancers, psychiatric disorders, neu- 
rological disorders, other diseases and body mass index, 
we report several findings. Diseases of similar classes are 
located near each other, supporting the genetic compo- 
nent of shared disease etiology. Genes that contributed to 
distinguishing between diseases are enriched for various 
pathways including those related to the immune system. 
These results further our knowledge of the genetic 
component of shared pathogenesis, highlight possible 
pathways involved and provide new guidelines for future 
genetic association studies. 

classifier [30] and linear mixed model approaches [27,31] have 
also been proposed for assessing the shared genetic variation 
between two diseases. 

These exciting new methods are powerful for studying shared 
genetic risk variants between diseases. At the same time, 
overcoming some of their limitations can improve the study of 
shared pathogenesis using data from multiple GWASs. First, some 
methods have focused on analysis of individual SNPs. Though well 
suited for scenarios of a single causal SNP in a locus, such methods 
would suffer a reduction in power when several causal SNPs exist 
or if different SNPs tag the same underlying causal variant, which 
is especially relevant for diseases with rare causal variants [32,33] 
and when the different GWASs are across different populations 
[34] or have used different genotyping arrays. Second, when 
considering the correlation between association statistics of 
different studies, it might be beneficial to not consider all variants 
equally (as is the case in [29]), whether or not they play a role in 
disease susceptibility. Third, most methods assume as known 
which diseases share pathogenesis, and while the shared patho- 
genesis of autoimmune disease has been well established [25,29], it 
is worthwhile to study shared pathogenesis of other disease classes 
[6,35,36]. And fourth, while some approaches perform well for 
two correlated traits or diseases, extending the analysis to more 
than two traits can become difficult [27]. 

In this study, we present a novel method, disPCA, which uses 
principal component analysis (PCA) to learn about the shared 
genetic risk of distinct diseases. PCA maps data from the original 
axes into new axes in principal component (PC) space via a stretch 
and rotation of the original axes. Each new axis or PC captures the 
maximal level of variation in the data not captured by previous 
PCs. Thus, each PC can potentially capture a different, orthogonal 
story told by the data. Our method is based on summary level 
statistics from GWASs of different diseases. We combine data from 
individual SNPs into gene-based statistics via several p-value 
combination methods. PCA is applied to a matrix across genes and 
GWAS datasets, with entries representing the strength of 
association between a gene and the disease studied in a dataset. 
Thus, disPCA reveals principal components that are linear 
combinations of all genes, weighed in accordance with their role 
in differentiating between the different GWASs. It can be applied 



to study multiple diseases without prior knowledge of their shared 
pathogenesis, thereby overcoming all the limitations of existing 
methods outlined above. disPCA also accounts for potential 
confounders due to methodological differences between studies, 
such as in genotyping array, which can otherwise lead to these 
differences being captured by the PCA. 

Equipped with this novel method and with data from 3 1 GWAS 
datasets, we considered the level of shared pathogenesis between 
diseases and classes of diseases from all genes, which we term 
shared pathogeneses. Diseases with more similar underlying 
genetics are more likely to be located closer together in PC space. 
As PCA is a non-parametric method, it makes no assumptions 
regarding which diseases are more similar and does not aim to 
model it, thereby allowing discovery of new relationships between 
diseases by examining the top PCs. Each PC captures a different 
combination of genes that distinguish well between some diseases, 
or the remaining variation between diseases. No separation 
between diseases along a PC indicates that they tend to share 
the pathogenetics underlying that PC. By studying the set of genes 
underlying each PC for enrichment in specific pathways, we 
further assessed the function and relationship of genes that 
separate different disease clusters in PC space. 

Materials and Methods 

disPCA 

We developed a method, disPCA, for studying the relationship 
between diseases based on their level of disease risk genes shared. 
The method works on the gene-level by first combining 
information from all SNPs in and around each gene. Considering 
gene-level statistics compensates for different tag SNPs being 
associated in different datasets even in cases where they capture 
the same causal variant. It also aggregates information across 
multiple tag SNPs in each dataset, as well as allows for different 
underlying causal variants in the same gene being associated with 
the risk of different diseases. To be widely applicable, disPCA is 
based solely on the p-values of association of each SNP with the 
disease under study. Importantly, all SNPs and consequently all 
genes are considered, rather than focusing on genes that meet a 
genome-wide significance level of association with a disease. We 
apply PCA to many different GWASs to axiomatically find and 
assign importance to genes based on their contribution to 
distinguishing between diseases and disease classes. The ensuing 
distance between different disease datasets in PC space inversely 
corresponds to their level of shared pathogenetics. 

Gene-level significance levels 

For each protein-coding gene from the HGNC database [37], we 
mapped all SNPs that are in the gene or within 0.01 cM from it 
(genetic distances were determined via the Oxford genetic map 
based on HapMap2 data [38,39]). We discarded all SNPs that were 
not mapped to within 0.01 cM of any gene. If a SNP lay between 
two genes, it was assigned to the closer gene. For each GWAS 
dataset, we determined the significance of association of each gene 
with the assayed disease using the following simulation procedure. 
Let the observed p-value of a gene be the minimum p-value of the n 
SNPs mapped to the gene. We compared the observed p-value to 
that of 100,000 groups of n consecutive SNPs chosen in random. 
Based on these groups, we assign a new p-value to each gene as the 
proportion of groups for which the observed minimum p-value for 
that gene is less significant than that of the group. This random 
sampling procedure may be biased in regions of high linkage 
disequilibrium (LD) when mapping SNPs to genes using genetic 
distance (e.g. consecutive SNPs in regions of high LD will be more 
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correlated than those in regions of lower LD). However, for any 
given gene, these will equally affect each of the datasets. To validate 
this, we also applied disPCA to p-values obtained from mapping 
SNPs to genes using physical distance: a SNP was mapped to a gene 
if it was in the gene or within 10 kb of it. Comparing these results to 
results based on mapping via genetic coordinates revealed the same 
clustering of diseases (Figure SI). Furthermore, in studying the 
loading of each gene, namely their contribution to each PC, we 
found that the genes with the top 50 average loadings on the first 
two PCs were significandy correlated (r>0.67, p-value<8.4x 10 8 , 
Table SI). Thus, in the main text we present results based on 
mapping by genetic distance as described above. 

To consider information from beyond only the most significant 
SNP in a gene, we also implemented the truncated tail strength 
[40] and the truncated product methods [41] to combine p-values 
in each gene in replacement of the minimum p-value, and 
followed a similar procedure for assigning new gene-level p-values. 
For the analyses presented in the following, results from all 
methods were similar though results with the minimum p-value 
approach clusters similar diseases better (Figure S2, S3). We thus 
only report in the main text results from the minimum p-value 
approach. Code to carry out this procedure is publicly available at 
http://keinanlab.cb.bscb.cornell.edu/content/tools-data. 

PCA implementation and confounders 

Assume a matrix Z, adxg matrix of the — logi 0 gene-level p- 
values, where d is the number of GWAS datasets, and g is the 
number of genes present in all datasets. We center the matrix by 
subtracting the column means from each column. Thus the 
centered matrix B has entries: 

d 

««v=^-^y- (i) 

To obtain the PCs of matrix B, we must find the eigenvectors 
and eigenvalues of its covariance matrix BB T . Let v, be a vector of 
length d and let A,- be a scalar. Vi is the eigenvector and X the 
eigenvalue of BB T if the following is satisfied: 

(BB T )v,=A iVi (2) 

The principal components of B are the normalized eigenvectors 
of its covariance matrix, BB T , where the eigenvectors are ordered 
such that the largest eigenvalue corresponds to the first principal 
component. Each eigenvector is additionally orthogonal to all 
other eigenvectors. Thus, from (2), we can decompose BB T as 
follows: 

BB T = uJ2 uT ( 3 ) 

Where the columns of U contain the principal components and 
£ is a diagonal matrix with entries equal to the eigenvalues of 5's 
covariance matrix. One can similarly construct the singular value 
decomposition (SVD) of B. The SVD of B can be written as: 

B=VDW T (4) 

where V is a d xd matrix, D is a d xg diagonal matrix, and W is a 
g xg matrix. V and W contain the left and right singular vectors of 



B, respectively, and D contains the singular values of B in its 
diagonal. Substituting equation (4) for B in equation (3), we find 
that 

BB T = (VDW T )(WDV T )=VD 2 V T =Uj2 uT ( 5 ) 

Thus, the principal components of B, the eigenvectors of its 
covariance matrix, are equivalent to the left singular vectors of B. 
In addition, the eigenvalues of B are equivalent to the square of its 
singular values. 

We applied SVD to the matrix B using the R [42] 
implementation of PCA/SVD (prcomp), with no scaling of the 
data. Due to the heterogeneity of the GWAS datasets (Table S2), 
variation uncovered by PCA can also reflect differences in features 
such as genotyping array, association method, and sample size, 
rather than underlying disease risk genes. To ensure that these 
features did not influence our results, we first tested each gene for 
association with each of these features. Let Z; = Z, . be the vector 
corresponding to the association statistic for gene i across the d 
datasets. We considered a linear regression of Zi as a function of the 
covariates: z, ■ = a + b- u \ C\ + bn Ci + &/,3 C3 + e, where Cj, C 2 , C3 
are vectors of length d that represent the genotyping array, 
association method and the logio of the sample size respectively, in 
each of the studies (Table S2). Testing the significance of 
regression coefficients can reveal genes that are associated with 
any of these potential confounders. In our following analysis, 19 
genes were significantly associated with association method. 
However, genes not significandy associated to the above 
confounders may similarly have an effect. Hence, we also applied 
SVD (as described above) to the residualized matrix, namely 
matrix R with rows 7?,,, = z ( — (a + bi y \C\ + bi t 2C2 + bi^C{). We 
found that applying SVD to R results in the top PCs capturing a 
higher fraction of the variance of the data than when applied to 
the original matrix Z, though results are qualitatively similar 
between the two. We thus present results derived from the 
residualized matrix R. Resulting distances between datasets were 
assessed visually by plotting datasets in PC space. To quantify the 
clustering of datasets, we additionally applied hierarchical 
clustering in R [42] (hclust) to the Euclidean distance between 
pairs of datasets across the first two PCs. 

Simulation study 

We simulated a matrix Z for two disease classes, each with 5 
diseases (A 1 ,A 2 ,A 3 ,A 4 ,A S ,B 1 ,B 2 ,B 3 ,B 4 ,B S ) and 10,000 genes. In 
general, under the null hypothesis of a region containing no risk 
variant and assuming no confounding factors (e.g. population 
stratification), p-values should be uniformly distributed between 0 
and 1 . On the other hand, associated risk variants should be enriched 
for smaller p-values. We thus considered three sets of genes. The p- 
values for the first set of genes was drawn from the (7(0, 1) distribution 
for all diseases, thus no pleiotropy was captured in this set of genes. 
The second set of genes was distributed (7(0,0.05) for the first disease 
class (A],...,Aj) and distributed (7(0,1) for the second disease class 
(B],...,B S ). Finally the third set of genes was distributed (7(0,0.05) for 
the following diseases: Aj, A 2 , Bj, B 2 and distributed (7(0,1) for all 
other diseases. Thus the second set of genes simulates pleiotropy 
between diseases in disease class A, while the last set of genes 
simulates pleiotropy between diseases in both disease classes. 

Disease and pathway enrichment analysis 

Disease enrichment analysis was completed using the online tool 
WebGestalt [43,44] to query the PharmGKB [45] database. 
WebGestalt tests for enrichment of a category of genes in the 
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observed set of genes using the hypergeometric test [43]. 
Bonferroni correction for multiple tests was applied and all 
reported p-values are following this correction. We restricted 
analysis to categories that contained a minimum of 5 genes in our 
analysis with the largest 50 weightings in the top two PCs. For 
gene categories with overlapping or the same set of genes, we list 
the most significant category. To reduce biases introduced by the 
clustering of genes with similar function, we filtered our list of 
genes with the top 50 loadings on the top two PCs by removing the 
latter gene out of a pair of genes within 0.1 cM of each other. We 
then applied WebGestalt to this filtered subset of genes. 

Pathway enrichment analysis was completed using the Gene Set 
Enrichment Analysis (GSEA) tool [46]. GSEA sorts genes 
according to a score, which here is the weighting of a gene in 
the PC under study. It then assesses whether genes belonging to a 
certain category (e.g. pathway) are non-randomly distributed in 
the sorted list. As input to GSEA, we utilized the weights of genes 
in the top two PCs. GSEA carried out 10,000 gene-set 
permutations to determine FDR (false discovery rate) q-values. 
We queried the BioCarta and KEGG pathway databases. We 
restricted analysis to categories that contained a minimum of 5 
genes in our analysis. Throughout we present enrichment analysis 
only for the top two PCs, though other PCs are available and can 
be assayed for further insight into the diseases studied. We 
considered an FDR of 0.25, suggested by GSEA [46] (GSEA 
manual online), though this entails that 1 in 4 of our results are 
false positives on average. As above, to reduce biases introduced 
by the clustering of genes with similar function, we filtered our full 
list of genes by removing the latter gene out of a pair of genes 
within 0. 1 cM of each other and reanalyzed this subset of genes 
(n = 5,298) with GSEA. 

Testing for non-random distribution of p-values 

We followed a similar approach to that implemented in 
Zhernakova et al. 201 1 [21] while applying it to genes instead of 
individual SNPs to test for non-random distribution of association 
values. For each disease pair we retained all k genes that were 
nominally significant in one disease (p-value<0.01). We then 
tested the null hypothesis of a uniform distribution of p-values in 
the second disease using Fisher's method for combining p-values: 
k 

X2/ f = — 2 Yl l n (P;)> where pi is the p-value for association of gene i 
i=i 

in the second disease. Nearby genes in linkage disequilibrium may 
violate the independency assumption in Fisher's method. We thus 
performed a separate analysis after removing the latter of the two 
genes that were within 0.1 cM of each other and nominally 
significant in one disease. 

Application of disPCA to 31 GWAS datasets 

We analyzed a total of 31 GWAS datasets [10,47-76] that 
spanned different types of cancers, autoimmune diseases, neuro- 
logical disorders, psychiatric disorders, type-2 diabetes (T2D), 
ischemic stroke and body mass index (BMI) (Table S2). Datasets 
were publicly available, obtained from dbGaP or obtained via 
collaborations. These datasets had non-overlapping samples and 
were of European ancestry only. For Wellcome Trust Case 
Control (WT) related datasets, we distributed controls between the 
five datasets such that none had overlapping samples. For WT 
type-1 diabetes, rheumatoid arthritis and Crohn's disease, we 
obtained further controls from the WT hypertension, cardiovas- 
cular disease and bipolar disorder case data [10]. After obtaining 
gene-level association statistics for 14,018-17,438 autosomal genes 
for each dataset, we limited our analysis to the 11,927 genes that 



overlapped all studies. Nineteen of these genes were significandy 
associated with association method after multiple-testing correc- 
tion (see above). 

Replication of disPCA 

We tested the replicability of disPCA when applied to real 
GWASs using six datasets for which we had access to the original 
data [10,57,60,61,74,75]. Each dataset was split into independent 
subsets of equal size (+/ — two samples). We then used PLINK's 
logistic regression [77] to evaluate association of each SNP to 
disease risk. We additionally incorporated covariates derived from 
EIGENSOFT into the regression analysis [78] to control for 
population structure. We randomly chose one subset of each of the 
six datasets for one disPCA analysis, and the rest for another. 
Hence, these two analyses consist of independent samples. 

Results 

We first applied disPCA to a simulated dataset (Materials and 
Methods). We varied the number of genes that have correlated 
association results across simulated datasets, thereby varying the 
level of pleiotropy between the simulated diseases. disPCA clearly 
clustered pleiotropic diseases when diseases shared at least 40 
shared genes with p-values randomly distributed below 0.05 in 
each disease (Figure la-b, S4, S5, S6). This can be seen both 
visually via PCA plots, and via hierarchical clustering based on the 
Euclidean distance between datasets in the presented space of the 
first two principal components (PCs) (Figure 1, S4, S5, S6). When 
diseases are indeed clustered by their simulated pleiotropy 
according to disPCA (Figure lb), the first two PCs explain a 
similar fraction of the variance (Figure lc), which may increase or 
decrease depending on the number of genes contributing to 
pleiotropy (Figure S7). We next examined the contribution of each 
gene to each PC as captured by its absolute "loading". 
Considering the first two PCs in this disPCA analysis, genes with 
p-values<0.05 (Materials and Methods) are also enriched for 
larger absolute loadings, stressing their role in differentiating 
between the simulated disease classes (Figure ld-e). 

We next applied disPCA to empirical data from GWAS datasets. 
First, we considered only diseases for which we had two datasets: 
autoimmune diseases (for which we had the most pairs of datasets) 
and a pair of schizophrenia datasets (as schizophrenia has a high 
heritability [79]). We observed that datasets of the same diseases 
were generally clustered together (Figure 2-3). We additionally 
observed that Crohn's disease is separated from other autoimmune 
diseases. This result is consistent with previous reports that 
inflammatory bowel disorders (IBDs) are distinct from other 
autoimmune disorders [29]. As in the simulated scenarios, the 
variance explained by each PC was similar (Figure 2b), and the 
results suggest that less than a hundred genes contribute to the 
similarity between each pair of datasets (Figure 3c-d). 

To test the replicability of the results, we further divided each of 
the six datasets, for which we had the raw data, into two subsets 
consisting of the same or similar number of cases and controls 
(Materials and Methods). We then performed two disPCA 
analyses, one based on a randomly chosen subset of each of the 
six datasets, and another based on the remaining subset of each 
dataset. We found that both independent sets produced the same 
clustering of diseases (Figure S8, S9). Loadings for 50 genes with 
the largest average loading across the two disPCA analyses of PC 1 
and PC2 were also significantly correlated across the two (r>0.44, 
p-value<1.2xl0" 3 , Table S3). These results point to disPCA 
capturing some of the same pleiotropy in both cases, and further 
support the replicability of its results. 
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Figure 1. disPCA of ten simulated diseases. The p-values for ten diseases were simulated for 10,000 genes (Materials and Methods). Class A 
diseases had p-values uniformly distributed between 0 and 0.05 for 40 of the 10,000 genes, while two diseases from class A (A_1, A_2) and two 
diseases from class B {B_1, B_2) had p-values similarly distributed for a separate set of 40 genes (Materials and Methods). All other genes had p-values 
that were randomly distributed in each disease between 0 and 1. A) The simulated data is displayed on PCI and PC2. PC1 separates (A_1, A_2, B_1, 
B_2) from all other diseases, while PC2 separates class A diseases from class B diseases. B) Dendrogram derived from a clustering analysis based on the 
Euclidean distance between datasets in the space of the first two PCs (represented as the height of the branches). C) PC1 and PC2 account for a 
similar amount of variance. D) Loadings for each gene are displayed sequentially for PC1 . The 40 genes contributing to pleiotropy between the two 
diseases in each class (displayed as the last 40 genes) are enriched for larger absolute loadings. E) Similar to (D), with loadings for PC2 displayed. A 
separate set of 40 genes contributing to correlation between diseases in each class are also enriched for larger loadings. 
doi:10.1371/journal.pcbi.1003820.g001 



We applied disPCA to a final set of 31 datasets (Table S2), 
including autoimmune diseases, cancers, obesity-related diseases 
and traits, psychiatric disorders and neurological disorders. The 
first two PCs capture visually-interpretable separation of diseases. 
PC 1 for the most part splits the two systemic lupus erythematosus 
(SLE) and the one dataset of celiac disease from all other datasets 
(Figure 4). Independent of that separation, PC2 splits autoimmune 
diseases (in purple) from other diseases, and within autoimmune 
diseases, inflammatory bowel disorders (Crohn's disease and 
ulcerative colitis) are clustered together (Figures 4—5). Schizophre- 
nia, major depressive disorder, cancers, T2D and neurological 
disorders lie on the negative end of PC2, while attention deficit 
hyperactivity disorder (ADHD), and some autoimmune diseases 
that are not well separated on this PC from other diseases, lie near 
the origin. PCs beyond the first two explain almost the same 
fraction of the variance (Figure 4b) and hence merit further 
investigation (see Discussion). 

As disPCA teases out the important genes of shared and distinct 
pathogenetics across disease datasets, we next investigated which 
genes strongly contribute to each PC based on their absolute 
loadings. Specifically, we retrieved the genes with the top 50 
absolute loadings for each of the top two PCs underlying Figure 4 
and tested their disease enrichment (Materials and Methods). The 
top genes underlying the first PC were significandy enriched for 
genes associated with lupus and autoimmune related diseases, 
while genes underlying the second PC were mostly enriched for 
association to IBD (Table 1). These enrichment results are 



consistent with the separation of studies across each of these 2 
PCs with PCI mosdy separating studies of SLE and celiac diseases, 
and PC 2 mostly separating studies of IBD from other diseases. 
The results were largely unchanged following filtering genes that 
were within 0.1 cM of each other to account for linkage 
disequilibrium and for similar genes being co-located to each 
other, such as gene families (Table 1) (Materials and Methods). 

Though the results of the disease enrichment analysis support 
that disPCA extracts biologically relevant signals, the arbitrary 
cutoff of the 50 top genes goes against the potential of PCs being 
linear combinations of all genes. We thus used GSEA [46], which 
supports analyzing a pre-ranked list of all genes, to perform 
pathway enrichment of each PC. GSEA assesses whether genes 
belonging to a certain pathway are non-randomly distributed in 
the list of pre-ranked genes. We ranked all genes by the absolute 
loading in the PC under study. Results of this pathway analysis 
revealed enrichment for immune related pathways on the first 2 
PCs (Table 2) at an FDR of 0.25. The top two pathways enriched 
on PCI were the antigen processing and presentation and the 
intestinal immune network IgA production pathways, which are 
crucial immune-related pathways. In particular, intestinal IgA 
antibodies may have a role in celiac disease [80] and inflammatory 
bowel disease [81,82]. On PC2, the most significant pathway was 
the NOD-like receptor signaling pathway. NOD-like receptors 
have been associated to Crohn's disease, while other immune- 
related genes likely interacting with NOD2 have been associated 
to ulcerative colitis [83]. Other immune system pathways were 
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Figure 2. disPCA of datasets of the same disease. A) Pairs of datasets of the same autoimmune diseases and schizophrenia are displayed on PCI 
and PC2. Dataset labels are indicated in the form of disease-type_ study-name. The size of points is proportional to the sample size of the original 
study (Table S2). Diseases include systemic lupus erythematosus (SLE), vitiligo (Vit), multiple sclerosis (MS), schizophrenia (Schizo) and Crohn's disease 
(CD). Datasets of the same diseases tend to lie closer together on PC1 and PC2. B) The portion of variance explained by each PC is displayed. C) The 
weightings for genes on PC1 are displayed and ordered according to their weights. D) Similar to (C) with loadings for PC2. 
doi:10.1371/journal.pcbi.1003820.g002 



Figure 3. Clustering dendrogram of datasets of the same 
disease. Each pair of datasets of the same disease cluster together 
based on hierarchical clustering applied to the Euclidean distance 
between datasets in the first two PCs presented in Figure 2 (Materials 
and Methods). The height of the branches represents the Euclidean 
distance between datasets in the space of the first two PCs. 
doi:1 0.1 371 /journal.pcbi.1 003820.g003 



enriched, including the Fc epsilon RI signaling pathway that is 
related to the antibody IgE, which induces inflammatory response 
[84]. Two enriched pathways are related to neurons (i.e. the 
neurotrophin signaling pathway and the Trk-A pathway). In 
particular, the neurotrophic factor BDNF (brain-derived neuro- 
trophic factor), which is a part of the neurotrophin pathway, has 
been previously associated to Alzheimer's, Parkinson's disease and 
depression [85-87]. More recently, an intronic variant in this gene 
has also been associated to BMI [88]. The contribution of genes in 
these pathways to PC2 may explain the separation of neurological, 
psychiatric and BMI studies along that PC. As above, we reran 
GSEA after filtering genes that were within 0. 1 cM of each other 
(Materials and Methods). The top two pathways on the first PC 
remained significant, while only the top pathway in PC 2 remained 
significant (Table S4). This is likely due to the contribution to 
enrichment of several genes that are co-located, which should 
hence not necessarily be discounted. 

Many autoimmune diseases share associations from the HLA 
region. We thus reran disPCA after removing all genes in and 
around the HLA region, and found a slightly different visual PCA 
map (Figure 6). SLE and celiac disease were no longer 
distinguished from other autoimmune diseases and instead lay 
near the origin. PCI now differentiated IBD from other diseases, 
and PC 2 separated some autoimmune diseases from the rest on 
one extreme, and schizophrenia from the rest on the other. This 
was further supported by clustering results on the first two PCs 
(Figure S10). A GSEA analysis of the PC loadings retained the 
NOD-like receptor signaling pathway on PCI instead of PC2 
(Table 3). Analysis of PC 2 loadings revealed additional immune 
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Figure 4. disPCA of all diseases and traits. A) Autoimmune diseases (purple), cancers (pink), psychiatric disorders (yellow), neurological disorders 
(green), and other diseases and traits (grey) are shown on PC1 and PC2. PCI accounts for 4.48% of the variance, while PC2 accounts for 4.21%. 
Additional diseases include Alzheimer's disease (Alz), amyotrophic lateral sclerosis (ALS), ankyolosing spondylitis (AS), attention deficit hyperactivity 
disorder (ADHD), Behcet's disease (Behcets), body mass index (BMI), breast cancer (BreastC), celiac disease (CeliacD), ischemic stroke (IscStroke), major 
depression (MajDep), Parkinson's disease (Parkin), prostate cancer (ProstateC), psoriasis (Psor), rheumatoid arthritis (RA), type-1 diabetes (T1 D), type-2 
diabetes (T2D), ulcerative colitis (UC). PC1 clusters celiac disease and SLE together, while PC2 separates inflammatory bowel diseases from other 
diseases and traits. B) The portion of variance explained by each PC is displayed. Three additional PCs explain 0% of the variance corresponding to the 
number of confounders we accounted for (Materials and Methods). C) The weightings for genes on PC1 are displayed and ordered according to their 
weights. D) Similar to (C) where loadings are for PC2. 
doi:1 0.1 371 /journal.pcbi.1 003820.g004 



related pathways that were not enriched in our previous analysis 
that included the HLA region. 

Results such as PC 1 in the main analysis clustering schizophrenia 
close to some autoimmune diseases (Figure 4) prompted us to further 
explore the shared pathogenetics between diseases by testing for the 
non-random distribution of gene-based p-values in one disease based 
on their nominal significance in another disease (Materials and 
Methods). Generally, the results show that association statistics are 
non-randomly distributed when considering most pairs of autoim- 
mune diseases, i.e. testing for non-random distribution in one 
autoimmune disease dataset based on significance in another 
autoimmune disease dataset (Figure 7). As a control, we tested for 
non-random distribution for a random set of genes and found that no 
disease pair was significant for non-random distribution (Figure S 1 1). 
Our results reported a similar story as observed via disPCA. Genes 
nominally significant in rheumatoid arthritis, type-1 diabetes and 
ankyolosing spondylitis were non-randomly distributed in SLE and 
vice versa. We also found that genes nominally significant for one 
schizophrenia study were non-randomly distributed in a number of 
autoimmune diseases (Figure 7). These signals remained even after 
genes within 0.1 cM of another gene were removed (Figure SI 2) 
(Materials and Methods). 

Discussion 

In this study we introduced a new method, disPCA, to explore 
the shared pathogenetics of various diseases and disease classes 



based on GWAS data. PCA has been widely used in population 
and medical genetics. Applied to genome- wide genotyping data, it 
can recapitulate population structure such as revealing European 
geography [89], has been used as a tool to assess and correct for 
population stratification in GWAS [78,90] and has also been 
proposed as a tool for reducing the dimensionality of multiple 
phenotypes for association analysis [91]. Our disPCA method 
considers PCA on a different type of matrix, whereby different 
GWASs are studied in the space of all genes. It can group GWASs 
of different diseases together based on gene-level association 
statistics, while accounting for biases due to heterogeneity in 
sample size, association method, genotyping array and other 
confounders between studies. This implementation of PCA assigns 
weights to each gene and each PC in a manner that maximizes the 
variation between diseases. Hence, the higher the level of shared 
pathogenetics between diseases, the closer they will be in PC space. 
This is in contrast to methods that considered the correlation 
between diseases across all SNPs [29]. In fact, when we consider 
such correlations in our data, it is generally very low, even when 
considering it on the gene rather than on the SNP-level and even 
when the same disease is studied. For example, the correlation 
coefficient between the —log 10 p-values of the two Crohn's disease 
studies is 0.048, and it is 0.063 and 0.031 between ulcerative colitis 
and each of the two Crohn's disease studies. More generally, the 
highest correlation between pairs of datasets of the same disease 
was obtained for schizophrenia (0. 1 3, p-value = 2.2 x 10" 16 ) while 
the lowest was obtained for type-2 diabetes (0.0031, p-val- 
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Figure 5. Clustering dendrogram of datasets of all diseases and traits. Dendrogram derived from hierarchical clustering analysis applied to 
distance (in PC space) between datasets presented in Figure 4. Inflammatory bowel diseases are clustered together, in addition to SLE and celiac 
disease. 

doi:10.1371/journal.pcbi.1003820.g005 



ue = 0.73). These results show that there is less power when 
aggregating information across all genes and that disPCA is able to 
tease out and weigh the suitable set of genes underlying shared 
pathogenetics. 

Though disPCA is designed to uncover shared disease etiology 
between diseases, other sources of correlation between datasets can 
also contribute to its results. Potential confounders include shared 
samples between datasets, technical artifacts, and population 
structure (if risk factors vary across ancestry). We accounted for 
technical artifacts introduced by the genotyping array, association 
method and sample size by regressing out variation in the data 
attributed to these sources (Materials and Methods). To minimize 
the impact of population structure and shared samples, we only 
applied disPCA to studies of individuals of European ancestry and 



datasets that had no overlapping case or control data. Though we 
cannot account for other potential confounders that are unknown, 
our results strongly suggest that the remaining correlation between 
studies represent shared disease etiology. 

We applied disPCA to data from 3 1 GWASs that cover a range 
of diseases in four main classes: autoimmune diseases, cancers, 
neurological disorders and psychiatric disorders. We additionally 
analyzed GWASs on T2D, BMI and ischemic stroke. We first 
observed that different studies of the same diseases tend to lie 
closer together on the lead PCs (Figure 2). This is in support of 
studies of the same disease replicating many of the same signals of 
associations when samples are of similar ancestry. We additionally 
find that disPCA positions diseases within the same class closer 
together (Figure 4). This was especially the case for the major types 
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Table 1. Disease enrichment analysis for disPCA (Figure 1). 





PC 


Disease 


P-value* 


P-value (distance pruned)* 


1 


Lupus erythematosus 


1.59x1(T 6 


3.0x10~ 8 




Arthritis 


1.72x1(T 6 


>0.01 




Connective tissue diseases 


5.00x1(T 4 


>0.01 




Autoimmune diseases 


2.6 x1(T 3 


2.05 x10~ 6 




Rheumatic Diseases 


2.6 x1(T 3 


>0.01 




Immune system diseases 


6.5x1(T 3 


2.2x10~ 5 


2 


Gastroenteritis 


5.79x1(T 13 


2.92x10 -9 




Crohn's Disease 


2.12x1(T 12 


1.73x10~ 8 




Inflammatory bowel diseases 


1.65XKT 11 


7.53x10~ 8 




Fistula 


4.00 x1(T 9 


1.37x10~ 7 




Gastrointestinal diseases 


3.49x10~ 8 


7.16x10~ 8 




Celiac disease 


2.75x10~ 5 


7.8x10~ 6 




Multiple sclerosis 


2x10" 3 


7x10~ 4 




Skin diseases, genetic 


2.3 x10~ 3 


8.1 x10~ 3 




Rheumatic diseases 


6.4 x10~ 3 


2.3x10~ 3 




Autoimune diseases 


9.6x10~ 3 


2.7x10~ 3 



•Bonferroni adjusted for multiple testing. 

Table shows disease enrichment results for all diseases significantly enriched with an adjusted p-value<0.01 . The distance pruned p-values refers to disease enrichment 
after removing the latter out of a pair of genes that were within 0.1 cM of each other. 
doi:1 0.1 371 /journal.pcbi.1 003820.t001 



of IBDs (i.e. Crohn's disease and ulcerative colitis), which clustered 
close together (Figure 5). This points to distinct etiology shared 
between IBDs, that is not shared between IBDs and most other 
autoimmune diseases. Indeed, it has recently been suggested that 
IBD is at least in part a primary immunodeficiency disorder 
[92,93]. Between the different disease classes, the main 2 PCs in 
disPCA found overlap between non-autoimmune diseases and 
traits, as well as pointed to a potential connection between 
schizophrenia and some autoimmune diseases. 



Using the weightings of genes on each of the leading PCs, we 
performed disease and pathway enrichment analysis. We found 
that PCI, which mainly splits some autoimmune disorders from 
other autoimmune disorders, is significantly enriched for genes 
associated to immune and autoimmune disorders. PC2, which 
splits IBD studies from studies of other diseases, is significandy 
enriched for genes in some inflammatory related pathways and 
genes associated with IBD. Further results in PC2 highlighted 
neuron-related pathways that can be in line with evidence that 



Table 2. Gene enrichment analysis for disPCA. 



PC 


Pathway 


FDR (q-value) 


1 


Antigen processing and presentation 


0.034 




Intestinal immune network for IgA production 


0.042 




Trk-A pathway 


0.169 




CK1 pathway 


0.213 




DREAM pathway 


0.228 




Valine leucine and isoleucine biosynthesis 


0.228 




O-glycan biosynthesis 


0.243 




Folate biosynthesis 


0.246 


2 


NOD-like receptor signaling pathway 


<1 xlO" 4 




Intestinal immune network for IgA production 


0.074 




Neurotrophin signaling pathway 


0.165 




Chemokine signaling pathway 


0.195 




Fc epsilon Rl signaling pathway 


0.232 




Terpenoid backbone biosynthesis 


0.232 




JAK-STAT signaling pathway 


0.238 


Table shows pathways that are enriched 


in the disPCA analysis based on the GSEA analysis. 





doi:1 0.1 371 /journal.pcbi.1 003820.t002 
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Figure 6. disPCA of all diseases and traits excluding the HLA and surrounding region. A) Similar to Figure 4 where genes in the MLA and 

surrounding region were removed. Though IBD remains separated as in the original disPCA, the clustering of celiac disease and SLE is no longer 
captured by the top two PC's. B) The portion of variance explained by each PC is displayed. C) The weightings for genes on PCI are displayed and 
ordered according to their weights. D) Similar to (C) where loadings are for PC2. 
doi:1 0.1 371 /journal.pcbi.1 003820.g006 



abnormal neurotrophins levels in the brain have been associated to 
schizophrenia [94,95]. Excluding the HLA region revealed 
significant enrichment for genes in other immune-related path- 
ways. Though the specific analysis presented in this paper focused 
on the top two PCs, further PCs estimated by disPCA can be 
examined. For example, PC4 of disPCA on all GWASs 
distinguishes rheumatoid arthritis from other diseases (Figure 
SI 3). Pathway enrichment analysis highlighted the calcineurin 
pathway (FDR = 0.1 82), which is involved in T-cell activation. 
Additionally, though schizophrenia and vitiligo datasets are 
further apart on the first two PCs, each pair of datasets is 
clustered closer together on PC3 and PC4. Altogether, these 
results support the validity of the enrichment analysis based on 
disPCA . The analysis in turn also raises new hypotheses of disease 



etiology by pointing to additional pathways and enrichment for 
other diseases that were not previously observed. 

Prompted by the results of disPCA, we further explored shared 
pathogenetics by testing for the non-random distribution of 
association statistics between pairs of disease studies (Figure 7). 
Autoimmune diseases show non-random distribution of associa- 
tion statistics with one another. Interestingly, genes nominally 
associated with one of the schizophrenia studies were non- 
randomly distributed in studies of several autoimmune diseases 
(i.e. ankyolosing spondylitis, systemic lupus erythematosus, and 
T1D), in support of the above disPCA results. Interestingly, this 
relationship was only observed for one of the two schizophrenia 
studies we analyzed, which may be due to a number of factors, 
including high number of risk factors for schizophrenia, with 



Table 3. Gene enrichment analysis for disPCA without the HLA region. 



PC 


Pathway 


FDR q-value 


1 


NOD-like receptor signaling pathway 


0.006 




Local acute inflammatory response pathway 


0.143 


2 


Proteasome pathway 


0.077 




Th1-Th2 pathway 


0.102 




Proximal tubule bicarbonate reclamation 


0.135 




Adherens junction 


0.142 




RNA polymerase 


0.171 




CTLA-4 pathway 


0.173 


Table shows pathways that are enriched 


in the disPCA analysis based on the GSEA analysis after removing genes 


n the HLA and surrounding region. 



doi:1 0.1 371 /journal.pcbi.1 003820.t003 
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Figure 7. Non-random distribution of genes for all analyzed datasets from Figure 4. Genes nominally significant for diseases on the y-axis 
were tested for non-random distribution in diseases on the x-axis (Materials and Methods), with -log q0 presented on the color scale on the right. 
White entries denote p-values<1 x10~ 17 . The most significant results are for pairs of similar diseases and between pairs of autoimmune diseases. In 
addition, pairs between some autoimmune diseases and schizophrenia also display significant results. 
doi:10.1371/journal.pcbi.1003820.g007 



different ones being associated in different studies. If indeed 
autoimmune diseases and schizophrenia share disease etiology, 
then just as one would not include individuals with ulcerative 
colitis as controls for a Crohn's disease GWAS since they both are 
IBDs, one should also be wary of including individuals with 
autoimmune disorders in a schizophrenia GWAS (and vice versa) 
as doing so may decrease power in loci implicated in both diseases. 
Lack of power due to such or other reasons might also underlie our 
lack of observation of significant shared etiology between the 
second schizophrenia dataset and autoimmune diseases. 

Finally, we make a few recommendations for future applications 
of disPCA to additional studies: (1) Biases can be introduced when 
studies share sample data; (2) As disPCA maximizes variance 
across diseases, genes that are implicated in all analyzed diseases 
will not contribute to the lead PC as they do not distinguish 
diseases from each other; (3) While here we only focused on using 
the strength of association and on gene-level signals, the method 
itself is highly flexible. One can further utilize the direction of 
association (protective versus deleterious), the heritability at each 
locus [96], an analysis at the pathway-level or in linkage 
disequilibrium blocks, include other non-genic functional ele- 
ments, and/ or environmental risk factors; (4) disPCA can be used 
to generate new hypotheses, which can then be tested by 
conducting more focused association studies in independent data 
or by using its output to better combine different diseases in an 
independent meta-analysis. New hypotheses can also be generated 
with regard to the genes that contribute to comorbidity between 
diseases. In conclusion, disPCA offers users a unique general 



overview of the disease landscape by studying their distinct and 
shared pathogenetics and flagging pathways and genes for further 
investigation. disPCA\ flexibility and computational efficiency 
proves itself as an excellent tool to be applied to additional diseases 
and disease classes to further our knowledge of shared pathoge- 
netics. 

Supporting Information 

Figure SI Clustering dendrogram of datasets of the 
same diseases using physical distance mapping. SNPs 
were mapped to genes if they were within 10 kb of the gene. 
Clustering analysis of resulting disPCA revealed the same clusters 
as disPCA with genetic coordinates (Figure 3). 
(TIFF) 

Figure S2 Clustering dendrogram of datasets of the 
same diseases with the truncated product method. 

Similar to Figure 3, with the truncated product method used to 

combine SNP p-values per gene. 

(TIFF) 

Figure S3 Clustering dendrogram of datasets of the 
same diseases with truncated tail strength method. 

Similar to Figure 3, with the truncated tail strength method used 

to combine SNP p-values per gene. 

(TIFF) 

Figure S4 Simulated diseases with ten nominally sig- 
nificant genes. A) Similar to Figure 1 in main text with only ten 
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nominally significant genes for each set of pleiotropic diseases 
(Materials and Methods). Clustering of the diseases sets is not 
observed. B) Clustering dendrogram as similarly presented in 
Figure lb. C) The portion of variance explained by each PC is 
displayed. D-E) The loadings for PCI and PC2 are displayed. 
(TIFF) 

Figure S5 Simulated diseases with twenty nominally 
significant genes. A) Similar to Figure 1 with twenty nominally 
significant genes for each set of pleiotropic diseases. As in Figure 
S2, diseases are not clearly clustering according to the sets though 
nominally significant genes are enriched for larger absolute 
loadings (Materials and Methods). B) Clustering dendrogram as 
similarly presented in Figure lb. C) The portion of variance 
explained by each PC is displayed. D-E) The loadings for PCI 
and PC 2 are displayed. 
(TIFF) 

Figure S6 Simulated diseases with thirty nominally 
significant genes. A) Similar to Figure 1 with thirty nominally 
significant genes for each set of pleiotropic diseases. The proper 
clustering of diseases is beginning to emerge. B) Clustering 
dendrogram as similarly presented in Figure lb. C) The portion 
of variance explained by each PC is displayed. D-E) The loadings 
for PCI and PC 2 are displayed. Genes with nominally significant 
p-values are enriched for larger absolute loadings. 
(TIFF) 

Figure S7 Simulated diseases with 100 and 200 nomi- 
nally significant genes. A) Similar to Figure 1 with 100 and 
200 nominally significant genes for the two sets of pleiotropic 
diseases. Disease sets are tightly clustered and the first two PCs 
explain a larger portion of the variance compared to other PCs. B) 
Clustering dendrogram as similarly presented in Figure lb. C) 
The portion of variance explained by each PC is displayed. D-E) 
The loadings for PCI and PC 2 are displayed. 
(TIFF) 

Figure S8 Clustering dendrogram of Replication Set 1 
datasets. Clustering of the distance in PC space between datasets 
in Replication Set 1. Diseases include vitiligo (Vit), multiple 
sclerosis (MS), schizophrenia (Schizo) and Crohn's disease (CD). 
(TIFF) 

Figure S9 Clustering dendrogram of Replication Set 2 
datasets. Similar to Figure S8 with datasets from Replication Set 
2. 

(TIFF) 

Figure S10 Clustering dendrogram of all diseases and 
traits excluding the HLA and surrounding regions. Figure is 
similar to Figure 5, with clustering analysis of distance between 
datasets based on the disPCA between all diseases and traits presented 
in Table S2 after removing the HLA and surrounding regions. 
(TIFF) 

Figure Sll Non-random distribution of randomly cho- 
sen genes. A random subset of genes were chosen to be tested for 
non-random distribution in diseases on the x-axis, with — logio 
presented on the color scale on the right. White entries denote p- 
values<l xlO" 17 . 
(TIFF) 

Figure S12 Non-random distribution for distance 
pruned set of genes. Genes were filtered such that no two 



genes were within 0. 1 cM of another. The remaining subset of 
genes was then tested for non-random distribution in diseases on 
the x-axis. The —login of the p-value is presented on the color 
scale and white entries denote p-values< 1 x 10 -17 . Results are 
largely similar to the original without filtering of nearby genes. 
(TIFF) 

Figure S13 PC3 and PC 4 of all diseases disPCA. Similar to 
Figure 4 with data being presented for PC3 and PC4. A) PCI 
accounts for 4.18% of the variance, while PC2 accounts for 
4.08%. PCI clusters schizophrenia and vitiligo datasets together 
on the two extremes, while PC 2 separates rheumatoid arthritis 
from other diseases and traits. B) The portion of variance 
explained by each PC is displayed. C) The weightings for genes 
on PC 1 are displayed and ordered according to their weights. D) 
Similar to (C) where loadings are for PC 2. 
(TIFF) 

Table SI Comparison of loadings between disPCA with 
mapping based on physical or genetic coordinates. 

Loadings for the top 50 genes ranked by either a physical or 
genetic coordinates based disPCA were compared. 'Correlation' 
denotes the Pearson's correlation coefficient with its significance 
denoted in the 'p-value' column. Rows denoted by 
'mean(PCl,PC2)' indicate the correlation between the 50 genes 
with the largest average loading of PCI and PC 2. 
(DOC) 

Table S2 Dataset attributes. Various attributes of datasets 

utilized in this study. 

(DOC) 

Table S3 Comparison of loadings between Replication 
Sets 1 and 2. Loadings for the top 50 genes ranked by either 
Replication Set 1 or Replication Set 2 were compared. 
'Correlation' denotes the Pearson's correlation coefficient with 
its significance denoted in the 'p-value' column. Rows denoted by 
'mean(PCl,PC2)' indicate the correlation between the 50 genes 
with the largest average loading of PCI and PC 2. 
(DOC) 

Table S4 Pathway enrichment after filtering nearby 
genes. Pathway enrichment was applied to a subset of genes that 
were located greater than 0.1 cM from each other. 
(DOC) 
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